Implement Spark decimal multiply and divide #4613

jinchengchenghh · 2023-04-14T12:20:28Z

Use Arrow Gandiva BasicDecimal128 algorithm to compute value. Introduce intermediate int256_t value to avoid computation overflow.

Add dependency on Boost multiprecision to get access to #include <boost/multiprecision/cpp_int.hpp> that defines int256.

Arrow implementation:

Resolve: #4773

netlify · 2023-04-14T12:20:34Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`fc5f92b`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/653f3e767470e40008f5d38a

jinchengchenghh · 2023-04-14T12:33:03Z

velox/functions/sparksql/DecimalArithmetic.cpp

+              "min(38, a_precision - a_scale + b_scale + max(6, a_scale + b_precision + 1))")
+          .integerVariable(
+              "r_scale",
+              "min(37, max(6, a_scale + b_precision + 1))") // if precision is


Usually, we get result type by function signature constraint, but spark decimal arithmetic is very complex, I'm not sure if we can represent the computation by string. If has, can anyone guide me?
Or can we send a function to signature, then compute the result type.
Now I compute the result type second times, it's not a good solution, and will cause function which revokes this function gets a wrong input argument type such as makeCheckOverflow.

And in Spark, the decimal arithmetic format is check_overflow(a * b), we will use the argument type in function apply, so it will be normal after this function.

Do you have some suggestions for here? @mbasmanova
You could just review the function signature, other part will be refactored after #4434 merge.

@jinchengchenghh

spark decimal arithmetic is very complex,

Would you share the details?

This is spark decimal result type computation https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala#L604

I don't know how to express the if in string constraint

This is the adjustPrecisionScale
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L171

@jinchengchenghh Is allowPrecisionLoss a session property? I imagine we would need to define 2 separate functions, one for allowPrecisionLoss=true and another for allowPrecisionLoss=false, and choose the right kind based on the value of allowPrecisionLoss.

// When `spark.sql.decimalOperations.allowPrecisionLoss` is set to true, if the precision / scale // needed are out of the range of available values, the scale is reduced up to 6, in order to // prevent the truncation of the integer part of the decimals. protected def allowPrecisionLoss: Boolean = SQLConf.get.decimalOperationsAllowPrecisionLoss

Yes, it is default true, current implementation is based on default value.

val DECIMAL_OPERATIONS_ALLOW_PREC_LOSS = buildConf("spark.sql.decimalOperations.allowPrecisionLoss") .internal() .doc("When true (default), establishing the result type of an arithmetic operation " + "happens according to Hive behavior and SQL ANSI 2011 specification, i.e. rounding the " + "decimal part of the result if an exact representation is not possible. Otherwise, NULL " + "is returned in those cases, as previously.") .version("2.3.1") .booleanConf .createWithDefault(true)

allowPrecisionLoss is a rarely used config, someone else interested in it can add the new function for allowPrecisionLoss=false, maybe in another PR.

Do you have suggestions to generate the result type by function adjustPrecisionScale? I don't have much context for string constrait which then is computed by expression::calculation::evaluate. @mbasmanova

@jinchengchenghh I'm not sure I understand this question. It might be better to create GitHub issue and move this discussion over there. Please, CC: @majetideepak and @karteekmurthys

majetideepak · 2023-04-14T13:35:05Z

@jinchengchenghh there is an overhaul of the Velox Type implementation.
See the discussion here #4069
As a result, the Decimal Type implementation will be changing with #4434
This work will have to be refactored to follow the new implementation.

jinchengchenghh · 2023-04-17T00:18:46Z

@majetideepak OK, I will refactor after it merges

jinchengchenghh · 2023-06-02T05:41:03Z

Looks like this boost version cannot include /usr/local/include/boost/multiprecision/cpp_int.hpp
I can build successfully in local with boost version Version: 1.71.0.0ubuntu2.
Can you guide me how to fix this problem in CI environment? @majetideepak

My build error is

velox/functions/sparksql/CMakeFiles/velox_functions_spark.dir/DecimalArithmetic.cpp.o -MF velox/functions/sparksql/CMakeFiles/velox_functions_spark.dir/DecimalArithmetic.cpp.o.d -o velox/functions/sparksql/CMakeFiles/velox_functions_spark.dir/DecimalArithmetic.cpp.o -c ../../velox/functions/sparksql/DecimalArithmetic.cpp
In file included from /usr/local/include/boost/multiprecision/detail/generic_interconvert.hpp:9,
                 from /usr/local/include/boost/multiprecision/number.hpp:26,
                 from /usr/local/include/boost/multiprecision/cpp_int.hpp:12,
                 from ../.././velox/functions/sparksql/DecimalUtil.h:19,
                 from ../../velox/functions/sparksql/DecimalArithmetic.cpp:20:
/usr/local/include/boost/multiprecision/detail/default_ops.hpp:910:17: error: 'iterator_range' in namespace 'boost::mpl' does not name a template type
  910 |    typedef mpl::iterator_range<start_seq, typename mpl::end<list_type>::type> range;
      |                 ^~~~~~~~~~~~~~
/usr/local/include/boost/multiprecision/detail/default_ops.hpp:914:17: error: template argument 1 is invalid
  914 |        pred_type>::type iter_type;
      |                 ^
/usr/local/include/boost/multiprecision/detail/default_ops.hpp:914:18: error: expected identifier before '::' token
  914 |        pred_type>::type iter_type;
      |                  ^~
/usr/local/include/boost/multiprecision/detail/default_ops.hpp:914:20: error: typedef name may not be a nested-name-specifier

jinchengchenghh · 2023-06-05T07:27:34Z

My build command is make debug EXTRA_CMAKE_FLAGS="-DVELOX_ENABLE_ARROW=ON"

It succeeds in my local environment, but raise following exception in CI, can you help guide how to fix this issue? @mbasmanova

boost::mpl::list exists in boost/mpl/aux_/preprocessed/plain/list.hpp, but I don't think I can include it directly, because it does not have flag such as #ifndef BOOST_MPL_ITERATOR_RANGE_HPP_INCLUDED

velox/functions/sparksql/CMakeFiles/velox_functions_spark.dir/DecimalArithmetic.cpp.o -MF velox/functions/sparksql/CMakeFiles/velox_functions_spark.dir/DecimalArithmetic.cpp.o.d -o velox/functions/sparksql/CMakeFiles/velox_functions_spark.dir/DecimalArithmetic.cpp.o -c ../../velox/functions/sparksql/DecimalArithmetic.cpp
In file included from ../.././velox/functions/sparksql/DecimalUtil.h:25,
                 from ../../velox/functions/sparksql/DecimalArithmetic.cpp:20:
/usr/local/include/boost/multiprecision/cpp_int.hpp:1224:13: error: 'list' is not a member of 'boost::mpl'
 1224 |        mpl::list<
      |             ^~~~
/usr/local/include/boost/multiprecision/cpp_int.hpp:1224:13: note: suggested alternatives:
In file included from /opt/rh/gcc-toolset-9/root/usr/include/c++/9/list:63,
                 from _deps/folly-src/folly/system/AtFork.h:21,
                 from _deps/folly-src/folly/detail/ThreadLocalDetail.h:41,
                 from _deps/folly-src/folly/ThreadLocal.h:52,
                 from _deps/folly-src/folly/executors/GlobalThreadPoolList.h:25,
                 from _deps/folly-src/folly/executors/ThreadPoolExecutor.h:28,
                 from _deps/folly-src/folly/executors/CPUThreadPoolExecutor.h:24,

mbasmanova · 2023-06-05T13:16:37Z

CC: @majetideepak @kgpai

Deepak, Krishna, any chance you can help here?

kgpai · 2023-06-05T15:47:13Z

@mbasmanova Taking a look.

kgpai · 2023-06-05T16:54:37Z

@jinchengchenghh Can you give more details on your local env ? Are you on linux using gcc ?
One way to replicate this locally is to :

Install docker
Pull the docker image used for CI (docker pull ghcr.io/facebookincubator/velox-dev:circleci-avx )
Copy over velox source code to that container (docker cp /path/to/your/velox /path/in/container)
Then replicate the build failure there and subsequently fix.

Feel free to reach out to me on Slack (@kpai) If you have any questions/need assistance.

jinchengchenghh · 2023-06-06T05:02:20Z

Thank you very much. I find in my local environment BoostSource is AUTO, but in CI, it is BUDDLED, I update the build boost script ,, this problem has been fixed. @mbasmanova @kgpai

jinchengchenghh · 2023-06-06T06:56:45Z

Can you help review? Thanks! @kgpai

majetideepak · 2023-06-06T11:39:33Z

velox/common/base/BitUtil.h

+    uint64_t lo = static_cast<uint64_t>(word);
+    return (hi == 0) ? 64 + __builtin_clzll(lo) : __builtin_clzll(hi);
+  } else {
+    return -1;


throw VELOX_UNSUPPORTED instead of returning -1?

majetideepak · 2023-06-06T11:43:37Z

@jinchengchenghh Can we converge the existing DecimalUtil functionality with this? Looks like this is an improvement and could benefit Presto as well.

jinchengchenghh · 2023-06-07T01:53:35Z

sparksql::DecimalUtil::divideWithRoundUp can divide without overflow in more cases, I'm not sure if this behavior complies with Presto. Presto may expect overflow in such cases? @majetideepak

jinchengchenghh · 2023-06-12T06:05:35Z

Can you help review? Thanks! @mbasmanova

majetideepak · 2023-06-12T22:09:53Z

sparksql::DecimalUtil::divideWithRoundUp can divide without overflow in more cases, I'm not sure if this behavior complies with Presto. Presto may expect overflow in such cases? @majetideepak

Presto Java uses the BigInteger class for decimals. I believe it does not overflow easily as well similar to spark.
@karteekmurthys any thoughts on this?

jinchengchenghh · 2023-06-26T09:29:19Z

maxBitsRequiredAfterScaling will check if it needs increase to int256 to guarantee not overflow, but the efficiency will be lower than before since it cannot get the accurateBitsRequiredAfterScaling. Can we accept the performance regression? If yes, I will refactor to current DecimalUtil.h @majetideepak @karteekmurthys @mbasmanova

karteekmurthys · 2023-06-26T17:18:43Z

sparksql::DecimalUtil::divideWithRoundUp can divide without overflow in more cases, I'm not sure if this behavior complies with Presto. Presto may expect overflow in such cases? @majetideepak

Presto Java uses the BigInteger class for decimals. I believe it does not overflow easily as well similar to spark. @karteekmurthys any thoughts on this?

That is true. It uses BigInteger support and doesn't overflow that easily. Our decimal arithmetic operations do verify if the results overflow Decimal limits and throw exceptions if they do. You can refer unit tests and e2e for that.

karteekmurthys · 2023-06-26T19:51:53Z

velox/functions/sparksql/tests/DecimalArithmeticTest.cpp

+    return makeLongDecimalFlatVector(int128s, DECIMAL(precision, scale));
+  }
+
+  int128_t convertStringToInt128(const std::string& value, bool& nullOutput) {


Use int128_t HugeInt::parse(const std::string& str)

karteekmurthys · 2023-06-26T19:54:25Z

velox/functions/sparksql/tests/DecimalArithmeticTest.cpp

+  bool nullOutput;
+  testDecimalExpr<TypeKind::HUGEINT>(
+      makeLongDecimalFlatVector(
+          {convertStringToInt128("201" + std::string(31, '0'), nullOutput)},


Or you could also replace with int128_t build(uint64_t hi, uint64_t lo) from Hugeint.h.

Understand that string form is readable but slower.

In tests, we should favor readability as top performance is not important.

karteekmurthys · 2023-06-26T20:08:16Z

velox/functions/sparksql/DecimalUtil.h

+    int256_t inAbs = abs(in);
+    bool isNegative = in < 0;
+
+    uint128_t unsignResult = (inAbs & uint128Mask).convert_to<uint128_t>();


If possible these 2 can be merged.

No, T is int64_t while the variable type uses uint64_t, we cannot get template UT as uint64_t

Does it require templating? You are using specific types in the functions. You could simply overload the functions.
or
May be you can still merge convert<int64_t> and convert<int128_t>.

Do

typedef typename std::conditional<std::is_same_v<TInput, int64_t>,uint64_t,__uint128_t>::type UT;

You can switch the unsigned type depending on T.

Thanks for your guide, fixed.

jinchengchenghh · 2023-07-14T06:23:46Z

Do you have further comments? @karteekmurthys

mbasmanova

@jinchengchenghh Thank you for the contribution.

Use Arrow gandiva BasicDecimal128 algorism to compute value, it will introduce int256_t to get much bigger result scale than presto.
For some corner cases, it is an approximate value to Spark.

This description is hard to understand as it is lacking details. What is the "Arrow gandiva BasicDecimal128 algorism" ? Would you include a link and/or summary of the algorithm?

it will introduce int256_t to get much bigger result scale than presto.

What is "much bigger result scale" ? Would you clarify?

For some corner cases, it is an approximate value to Spark.

What are these corner cases? Would you provide examples and a description of these cases?

mbasmanova · 2023-08-21T15:30:24Z

velox/expression/type_calculation/TypeCalculation.ll

@@ -22,6 +22,14 @@ var             ([[:alpha:]][[:alnum:]_]*)
 "="             return Parser::token::ASSIGN;
 "min"           return Parser::token::MIN;
 "max"           return Parser::token::MAX;
+"<"             return Parser::token::LT;


What are these changes for? It would be nice to extract these into a separate PR and add a test.

This will enhance the function signature, so @majetideepak suggests me to add it in this PR, #4773 (comment)

Do you make sure we should extract these into a separate PR?

Hmm... Can you describe these change in the PR description? Are there tests for these? Can you give me a pointer?

https://github.com/facebookincubator/velox/pull/4613/files#diff-4baee9f82f9c347f753972415d345941d2c2a110b608d480b090650171da096bR379
It is used to compute the result scale in function getResultScale

std::string getResultScale(std::string precision, std::string scale) { return fmt::format( "({}) <= 38 ? ({}) : max(({}) - ({}) + 38, min(({}), 6))", precision, scale, scale, precision, scale); }

Current Decimal multiply and divide test can cover <= and ? changes, I also extend the calculation to support basic operator like <.

If the computation is wrong, we cannot get the correct result type.

Got it. Let's extract these changes into a separate PR and add a test to velox/expression/tests/SignatureBinderTest.cpp to cover all of added functionality.

mbasmanova · 2023-08-21T15:31:09Z

CMake/resolve_dependency_modules/boost/CMakeLists.txt

@@ -52,7 +52,9 @@ set(BOOST_HEADER_ONLY
    crc
    circular_buffer
    math
+    mpl


Are you adding new dependencies to Velox? Please, update PR description to clearly state that and explain why these are needed.

Here I include this header file, so we will need it. https://github.com/facebookincubator/velox/pull/4613/files#diff-4b21d26191cbbdb9bc90cafd83aed144a4483b1c594f224cf3bbe67f401f5f83R19
#include <boost/multiprecision/cpp_int.hpp>.
It will introduce int256

mbasmanova · 2023-08-21T15:31:54Z

velox/common/base/BitUtil.h

-inline int32_t countLeadingZeros(uint64_t word) {
-  return __builtin_clzll(word);
+template <typename T = uint64_t>
+inline int32_t countLeadingZeros(T word) {


It would be nice to extract this change into a separate PR. This would make this PR smaller and easier to review.

mbasmanova · 2023-08-21T15:32:39Z

velox/docs/functions/spark/math.rst

@@ -57,6 +57,12 @@ Mathematical Functions
        SELECT 2L / 2L; -- 1.0
        SELECT 3 / 0; -- NULL

+.. spark:function:: divide(x, y) -> decimal
+
+    Returns the results of dividing x by y. The types of x and y must be decimal.


Would you update the docs for existing "divide(x, y) -> double" to clarify what types of x and y it supports?

Can x and y be decimals with different precision / scale? It so, can you make this clear? What would be the precision / scale of the result? Please, clarify.

mbasmanova · 2023-08-21T15:34:17Z

velox/functions/sparksql/DecimalUtil.h

+using uint128_t = __uint128_t;
+using int256_t = boost::multiprecision::int256_t;
+
+class DecimalUtil {


Please, document this class and its public methods.

mbasmanova · 2023-08-21T15:35:15Z

velox/functions/sparksql/DecimalUtil.h

+    // bits_required(x * 10^y) <= bits_required(x) + floor(log2(10^y)) + 1
+    // We precompute floor(log2(10^x)) + 1 for x = 0, 1, 2...75, 76
+
+    static const int32_t floorLog2PlusOne[] = {


Please, check coding style document and update the PR to comply. Here, the naming needs to be updated.

jinchengchenghh · 2023-08-24T14:22:41Z

Do you have further comments? @mbasmanova

mbasmanova

@Yuhta Jimmy, would you help review and merge this PR?

jinchengchenghh · 2023-10-11T09:14:45Z

Do you have further comments? Thanks! @Yuhta

zhouyuan · 2023-10-19T09:21:43Z

@Yuhta gentle ping, this patch is the base of decimal related functions for Spark SQL.

thanks, -yuan

Yuhta

@jinchengchenghh Looks good overall. Can you rebase and run the new functions through Spark function fuzzer to make sure there is no corner case bugs?

jinchengchenghh · 2023-10-30T07:09:12Z

Rebased and passed spark fuzzer test, can you help merge it? Thanks! @Yuhta

facebook-github-bot · 2023-10-30T16:35:04Z

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-10-30T19:38:51Z

@kgpai merged this pull request in 68b4698.

conbench-facebook · 2023-10-30T20:00:36Z

Conbench analyzed the 1 benchmark run on commit 68b4698c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2023

jinchengchenghh commented Apr 14, 2023

View reviewed changes

jinchengchenghh force-pushed the decimal_arith branch from b942721 to ce29c9b Compare April 14, 2023 12:34

jinchengchenghh mentioned this pull request Apr 28, 2023

Decimal function signature binder constraint cannot work in complex computation #4773

Closed

jinchengchenghh force-pushed the decimal_arith branch from 8a4d6b8 to eb4928d Compare June 1, 2023 11:45

jinchengchenghh marked this pull request as draft June 6, 2023 05:42

jinchengchenghh marked this pull request as ready for review June 6, 2023 06:56

majetideepak reviewed Jun 6, 2023

View reviewed changes

karteekmurthys reviewed Jun 26, 2023

View reviewed changes

jinchengchenghh force-pushed the decimal_arith branch 2 times, most recently from 3460bbb to bc0363a Compare June 28, 2023 05:12

jinchengchenghh force-pushed the decimal_arith branch from ebc1dee to 7b73aac Compare July 14, 2023 03:13

jinchengchenghh force-pushed the decimal_arith branch from 7b73aac to 2043985 Compare July 24, 2023 03:30

mbasmanova reviewed Aug 21, 2023

View reviewed changes

jinchengchenghh force-pushed the decimal_arith branch 5 times, most recently from e83dd19 to 6a73c7b Compare August 24, 2023 14:19

mbasmanova reviewed Aug 24, 2023

View reviewed changes

mbasmanova requested a review from Yuhta August 24, 2023 20:00

jinchengchenghh force-pushed the decimal_arith branch 3 times, most recently from 718a8f1 to 90c84de Compare September 11, 2023 02:58

Yuhta reviewed Oct 20, 2023

View reviewed changes

jinchengchenghh force-pushed the decimal_arith branch from 90c84de to fc5f92b Compare October 30, 2023 05:26

jinchengchenghh added 4 commits October 30, 2023 13:00

Implement Spark decimal multiply and divide

fcc51e4

optimize divideWithRoundUp and add test

11507b7

fix code style

4a934c0

use new API

fc5f92b

Yuhta approved these changes Oct 30, 2023

View reviewed changes

facebook-github-bot closed this in 68b4698 Oct 30, 2023

facebook-github-bot added the Merged label Oct 30, 2023

majetideepak mentioned this pull request Nov 6, 2023

Use Boost multiprecision to compute Presto decimal average value #7434

Closed

jinchengchenghh mentioned this pull request Jul 30, 2024

Move decimal-to-string API to DecimalUtil #10477

Closed

Implement Spark decimal multiply and divide #4613

Implement Spark decimal multiply and divide #4613

Conversation

jinchengchenghh commented Apr 14, 2023 • edited by mbasmanova Loading

netlify bot commented Apr 14, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak commented Apr 14, 2023

jinchengchenghh commented Apr 17, 2023

jinchengchenghh commented Jun 2, 2023

jinchengchenghh commented Jun 5, 2023

mbasmanova commented Jun 5, 2023

kgpai commented Jun 5, 2023

kgpai commented Jun 5, 2023

jinchengchenghh commented Jun 6, 2023

jinchengchenghh commented Jun 6, 2023

Choose a reason for hiding this comment

majetideepak commented Jun 6, 2023

jinchengchenghh commented Jun 7, 2023

jinchengchenghh commented Jun 12, 2023 • edited Loading

majetideepak commented Jun 12, 2023

jinchengchenghh commented Jun 26, 2023 • edited Loading

karteekmurthys commented Jun 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karteekmurthys Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinchengchenghh commented Jul 14, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinchengchenghh commented Aug 24, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

jinchengchenghh commented Oct 11, 2023

zhouyuan commented Oct 19, 2023

Yuhta left a comment

Choose a reason for hiding this comment

jinchengchenghh commented Oct 30, 2023

facebook-github-bot commented Oct 30, 2023

facebook-github-bot commented Oct 30, 2023

conbench-facebook bot commented Oct 30, 2023

jinchengchenghh commented Apr 14, 2023 •

edited by mbasmanova

Loading

netlify bot commented Apr 14, 2023 •

edited

Loading

jinchengchenghh commented Jun 12, 2023 •

edited

Loading

jinchengchenghh commented Jun 26, 2023 •

edited

Loading

karteekmurthys Jun 27, 2023 •

edited

Loading